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Abstract 

The  computational  complexity  for  parallel  implementation  of  multidomain 
spectral  methods  is  studied  to  derive  the  optimal  number  of  subdomains,  q,  and 
spectral  order,  n,  for  numerical  solution  of  hyperbolic  problems.  The  complex¬ 
ity  analysis  is  based  upon  theoretical  results  which  predict  error  as  a  function 
of  {q,  n)  for  problems  having  wave-like  solutions.  These  are  combined  with  a 
linear  communication  cost  model  to  study  the  impact  of  communication  over¬ 
head  and  imposed  granularity  on  the  optimal  choice  of  {q,  n)  as  a  function  of 
the  number  of  processors.  It  is  shown  that,  for  present  day  multicomputers, 
the  impact  of  communication  overhead  does  not  significantly  shift  (g,  n)  firom 
the  optimal  uni-processor  values,  and  that  the  effects  of  granularity  are  more 
important. 
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1  Introduction 

We  examine  the  computational  efficiency  for  parallel  solution  of  a  hyperbolic  partial 
differential  equation  in  d-space  dimensions  and  time  on  =  [-1,1]*^  x  [0,t].  We 
consider  a  class  of  model  problems  in  which  temporal  discretization  is  assumed  to 
be  based  upon  an  explicit  marching  procedure  and  spatial  discretization  is  based 
upon  a  spectral  subdomain  approach  with  Q  subdomains,  each  having  order  of  ap¬ 
proximation  n  in  each  spatial  direction.  The  choice  of  Q  and  n  is  determined  by 
the  dual  constraints  that  the  discretization  error  must  be  below  a  certain  level,  and 
that  the  computational  time  should  be  minimized.  This  optimization  problem  has 
been  previously  analyzed  for  the  uni-processor  case  [1].  Here,  we  consider  the  im¬ 
pact  of  non-uniform  memory  access  times  imposed  by  parallel  distributed  memory 
architectures. 

Examples  of  architecture  and  discretization  dependent  computational  cost  anal- 
vses  have  also  been  presented  in  [2,  3,  4,  5].  Keller  and  Schreiber  [2]  study  the  costs 
for  parameter  continuation  of  steady-state  Navier-Stokes  calculations  on  vector  super¬ 
computers  with  a  cost-functional  that  incorporates  both  CPU  and  memory  charges, 
and  with  the  principal  independent  parameter  being  the  frequency  of  full-Newton 
vs.  chord  continuation  steps.  Chan  and  Shao  [3]  study  the  question  of  the  optimal 
number  of  subdomains  for  the  solution  of  elliptic  problems  via  domain  decomposition 
based  iterative  solvers.  In  that  study,  a  subdomain  defines  the  support  of  local  block 
preconditioners  and  corresponding  coarse-grid  operators,  rather  than  redefining  the 
underlying  discretization.  Rqnquist  [4]  examines  the  trade-off  between  number  of 
subdomains  and  local  approximation  order  for  the  spectral  element  method  applied 
to  serial  solution  of  model  elliptic  problems.  Fischer  and  Patera  [5]  provide  both 
theoretical  and  experimental  performance  analysis  for  the  spectral  element  method 
on  hypercubes. 

In  the  present  study,  we  use  an  explicit  error  estimate  for  hyperbolic  problems 
having  wave-like  solutions  as  derived  by  Gottlieb  and  Wasberg  [1]  to  determine  the 
optimal  discretization  for  a  tensor-product  based  model  problem  on  distributed  mem¬ 
ory  architectures.  Although  fairly  simple,  this  model  embodies  many  essential  dis¬ 
cretization/architectural  details  which  must  be  considered  in  designing  an  algorithm 
for  modern  day  supercomputers.  These  include:  the  trade  off  between  high-  and 
low-order  accuracy  discretizations,  with  compensating  resolution  to  maintain  a  fixed 
accuracy;  the  cost  of  non-local  memory  accesses,  accounted  for  by  a  message-passing 
model;  and  the  imposition  of  a  fixed  granularity  due  to  the  fact  that  P  processors 
are  employed  in  the  simulation.  It  should  be  noted  that,  in  contrast  to  high-order 
finite  difference  schemes,  the  spectral  subdomain  approach  is  in  fact  a  heterogeneous 
discretization  in  the  sense  that  partitioning  the  domain  along  subdomain  boundaries 
induces  far  less  communication  than  would  arise  if  the  subdomains  themselves  were 
subdivided.  This  heterogeneity  is  well  suited  to  computer  architectures  exhibiting  a 
two-tiered  (local,  and  non-local)  memory  access  cost. 
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2  A  Hyperbolic  Model  Problem 

As  a  model  problem,  we  consider  the  d-dimensional  convection  equation  on  Q  = 
[-1,  X  [0,t]  given  by: 

^  +  V-{U<j>)  =  f{S,t)  (1) 

(j){x,  0)  =  ^o{x)  , 

where  the  unknown,  0,  is  a  scalar  convected  with  a  given  velocity  field,  U{x,t),  and 
/  is  a  known  forcing  function.  Appropriate  boundary  conditions  are  assumed  on  each 
face  of  the  domain  -  the  details  are  not  important  to  the  cost  analysis  considered  here. 
We  assume  that  (1)  is  discretized  in  time  via  an  explicit  time-marching  procedure 
requiring  only  local  vector  updates  and  no  system  solves  to  advance  the  solution  at 
each  step. 

Spatial  discretization  is  based  upon  a  spectral  subdomain  approach.  Let  Q  =  q'^ 
be  the  total  number  of  subdomains,  and  N  =  ho  the  number  of  points  within 
each  subdomain;  q  and  n  are  the  number  of  subdomains  and  points  (order  of  approx¬ 
imation)  in  each  spatial  direction,  respectively.  The  total  number  of  grid  points  is 
therefore  QN  =  {qnY-  The  d-dimensional  domain  is  constructed  as  a  tensor  prod¬ 
uct  of  one-dimensional  partitions  of  [-1, 1].  We  assume  a  uniform  partition  in  each 
direction,  yielding,  e.g.,  a.  q  x  q  array  of  (n  x  n)  squares  in  two  dimensions.  Note 
that  in  the  limit,  n  — 1,  we  recover  a  low-order  finite  difference  or  finite  element 
approximation.  In  the  other  extreme,  q  — )•  1,  we  recover  a  spectral  discretization. 

The  discretization  leads  to  an  evolution  equation  of  the  form: 


i-i 


<!>' 


fc+i 


i=0 


:tk—i 


Y-F  +  / 


A:— 2 


(2) 


Here,  underscore  denotes  vectors  of  basis  coefficients,  V.  is  the  discrete  divergence 
operator,  F  is  the  vector  of  fluxes:  {K}j  =  Uj4>ji  d  =  1)  -  -  •  >  Q^i  ^.nd  the  /3iS  denote 
the  coefficients  for  an  /th-order  time  stepping  scheme  (e.g.  [6]).  More  stable  Runge- 
Kutta  schemes  can  be  accommodated  without  significantly  altering  the  computational 
complexity. 

We  assume  that  within  each  subdomain,  derivatives  are  evaluated  via  application 
of  a  one-dimensional  derivative  matrix  in  a  tensor-product  fashion,  e.g.,  in  IR^, 


dx 


[F(g)/j,0F^]F 
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where  ly  and  F  are  the  {n  x  n)  identity  matrices  and  Dx  is  the  (n  x  n)  differentiation 
matrix  which  may  be  derived  from  a  collocation  method  or  high-order  finite  difference 
method.  In  spectral  subdomain  methods  the  derivatives  at  the  interfaces  are,  in 
effect,  evaluated  with  nonsymmetric  stencils.  To  maintain  continuity  of  the  solution, 
information  needs  to  be  propagated  across  subdomain  boundaries.  Stability  and  nth- 
order  accuracy  (i.e.,  exponential  with  n)  can  be  achieved  just  by  propagating  surface 
function  values  across  the  interface;  higher  derivatives  or  additional  stencil  values  do 
not  need  to  be  shared  across  the  interface  [7].  Consequently,  the  communication  cost 
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is  essentially  the  same  as  that  for  a  low-order  scheme,  i.e.,  proportional  to  the  surface 
area  of  the  subdomain,  and  not  to  the  volume. 

For  the  model  problem  (2)  the  spatial  truncation  error  is  governed  by  the  choice 
of  discretization  for  the  gradient  operator.  For  solutions  exhibiting  wave-like  behavior 
with  maximum  wave  number  k,  the  spectral  subdomain  discretization  leads  to  an  error 
estimate  of  the  form  [1]: 


error 


(eTrk\ 
2qn) 


(3) 


Two  possible  convergence  strategies  ensue.  One  can  increase  the  number  of  subdo¬ 
mains,  leading  to  algebraic  convergence  of  order  n,  or  increase  the  approximation 
order  within  each  subdomain,  leading  to  exponential  convergence. 

Throughout  the  remainder  of  the  paper,  we  assume  that  the  error  is  constrained 
to  a  user  specified  value,  i.e.,  we  require 


<  e"' 


(4) 


and  seek  to  minimize  computational  cost  (time)  subject  to  the  constraint  that  e  is 
fixed.  Note  that  for  fixed  error  e,  an  increase  in  approximation  order  (n)  implies  a 
decrease  in  the  number  of  subdomains  (g),  and  vice-versa.  This  result  derives  directly 
from  the  error  constraint  (4)  independent  of  the  subsequent  computational  complexity 
analysis.  Moreover,  for  e  fixed,  an  increase  in  n  also  implies  a  decrease  in  the  number 
of  grid  points,  QN,  though  not  necessarily  a  reduction  in  the  total  work. 


3  Computational  Complexity 

We  consider  the  implementation  of  the  model  problem  (2)  on  a  distributed  memory 
MIMD  parallel  architecture.  The  programming  model  is  based  upon  the  single  pro¬ 
gram,  multiple  data  (SPMD)  paradigm;  each  processor  is  assumed  to  have  its  own 
private  address  space  and  non-local  memory  is  accessed  via  explicit  message  passing. 
Letting  P  =  p^  he  the  total  number  of  processors  to  be  employed,  we  assume  that  the 
subdomains  can  be  integrally  mapped  onto  the  processors  in  each  direction,  implying 
that  q  =  mp,  m  a  positive  integer.  The  discretization  on  each  processor  therefore 
consists  of  an  array  of  blocks,  with  each  block  containing  an  nth-order  spectral 
discretization  based,  e.g.,  on  tensor-product  Chebyshev  polynomials  of  degree  n. 

We  assume  that  the  communication  time  is  not  “covered”  by  computation,  i.e., 
that  arithmetic  operations  do  not  take  place  simultaneously  with  communication. 
The  F-processor  solution  time  Tp  is  therefore  the  sum 

Tp  =  r„  +  Te ,  (5) 

where  Ta  and  Tc  denote  the  arithmetic  and  communication  components,  respectively. 
The  optimal  number  of  subdomains  is  determined  by  minimizing  Tp  subject  to  the 
error  tolerance  constraint  (4). 

Advancement  of  the  solution  (2)  requires  several  vector-vector  updates,  each 
with  an  operation  count  0{QN)  =  Oiq'^n^).  In  addition,  the  discrete  divergence 
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operator  requires  one  derivative  evaluation  in  each  of  d  spatial  directions.  Because  of 
the  tensor  product  form  of  the  bases,  the  operation  count  for  each  nth-order  derivative 
evaluation  scales  as  0{dq^n'^'^^)  for  matrix-matrix  product  based  differentiation,  and 
0{dq'^n‘^  log  n)  for  fast-transform  approaches.  We  denote  the  combined  work  estimate 
as: 


Afops  =  A  , 


with  0  <  7  <  1.  Here,  A{d)  is  a  small  parameter  proportional  to  but  independent  of 
q  and  n.  The  non-integer  exponent  7  provides  the  flexibility  to  account  for  the  lower 
order  terms  in  the  operation  count  and  for  the  fact  that,  even  in  the  matrix-matrix 
product  case,  the  time  for  the  leading  order  term  will  typically  scale  less  rapidly  than 
the  operation  count  as  vector  performance  generally  improves  with  increasing 

n. 

Aside  from  the  interface  data  exchanges,  full  parallelism  is  attained  for  the  vector 
update  (2),  yielding  a  per-step  arithmetic  time  complexity  of: 
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qd^d+7 

P 


ta 


(6) 


where  4  is  the  (average)  time  required  for  a  single  arithmetic  operation  such  as 
multiplication  or  addition.  For  matrix-matrix  product  based  differentiation,  ta  will 
be  close  to  the  processor  clock-cycle  time,  as  good  use  is  made  of  local  memory 
hierarchies  (i.e.,  cache).  For  fast-transform  approaches,  ta  will  be  significantly  larger. 
However,  for  sufficiently  large  n,  this  is  clearly  balanced  by  the  C>(nlogn)  complexity. 

In  the  multi-processor  implementation,  communication  is  required  at  each  time- 
step  to  update  edge  values  shared  by  adjacent  subdomains.  For  most  modern  message 
passing  parallel  computers,  an  appropriate  model  of  contention-free  communication 
is  given  by  the  linear  function: 


tcomm^.  —  A  ^w'jta  • 


Here,  tcomm[w]  is  the  time  required  to  transmit  an  w-word  message  from  one  processor 
to  another,  a  and  fd  are  non-dimensional  constants  representing  message  start-up  time 
(latency)  and  asymptotic  per-word  transfer  time,  respectively.  In  appropriate  units,  /? 
is  the  inverse  of  the  communication  bandwidth.  For  the  d-dimensional  model  problem, 
each  processor  transmits 


words  per  step,  where  d  <  B  <  2d  is  a  parameter  which  is  dependent  on  the  prob¬ 
lem  and  possibly  varying  from  one  subdomain  to  the  next.  If  the  local  convection  is 
predominantly  in  one  direction,  information  will  flow  from  only  d  faces  of  each  sub- 
domain,  rather  than  from  each  of  the  2d  faces  present  on  each  subdomain.  We  will 
consider  the  suboptimal  case  where  information  must  propagate  in  each  direction. 

In  addition  to  the  face  data,  there  is  also  . . .  0(1)  edge  and  vertex  data 

which  must  be  communicated.  Although  of  a  lower-order  than  the  principal  face  data 
which,  these  terms  cannot  be  ignored  because  latency  effects  keep  communication  time 
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from  going  to  zero  as  the  message  size  decreases.  However,  we  do  not  directly  account 
for  these  lower  order  terms  in  the  present  model  for  the  following  reasons.  First,  for 
a  d-dimensional  tensor  product  geometry  such  as  employed  here,  it  is  possible  to 
properly  exchange  the  lower-order  values  in  the  course  of  exchanging  the  2d  faces 
without  any  extra  communication  (see,  e.g.,  [5]).  Second,  in  cases  where  the  mesh 
topology  does  not  permit  such  efficiencies,  it  is  unlikely  that  the  choice  of  the  optimal 
{q,  n)  pair  will  be  strongly  influenced  by  the  edge  exchanges;  they  generally  comprise 
short  messages  dominated  by  latency  costs  which  are  independent  of  {q,n).  We 
reconsider  this  point  in  the  results  section. 

Denoting  the  total  communication  cost  as  Tc,  we  have: 

Tc  =  il-6ip){T^  +  T0)  , 

where  the  Kronecker  delta  term,  (1  -  dip)  accounts  for  the  absence  of  communication 
in  the  uni-processor  case.  Here, 

Ta  =  Bata  (7) 


is  the  latency  term,  and 


Ta  =  B 


{qn) 


d-l 
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d-l 


-Pta 


(8) 


is  the  bandwidth  component. 

Combining  (6-8),  the  P-processor  solution  time  is: 


Tp  —  ta 
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pd 
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(9) 


It  is  clear  that,  for  fixed  problem  size,  {q,n),  latency  becomes  the  dominant  factor 
in  loss  of  parallel  efficiency  as  p  — >  CX3.  However,  the  latency  term  has  no  (g,  n)- 
dependence  and  therefore  does  not  influence  the  optimal  discretization  choice. 


4  Cost  Analysis 


We  now  consider  the  problem  of  finding  a  discretization  pair  (g,  n)  which  minimizes 
Tp  subject  to  the  error  tolerance  constraint  (4).  Clearly,  since  is  independent  of  g 
and  n,  we  need  only  consider  the  sum  of  Ta  and  T^. 

The  dimension  of  the  parameter  space  (g,  n)  can  be  reduced  from  two  to  one  by 
using  the  constraint  (4)  to  solve  for  qn  in  terms  of  n: 


eirk  i. 

qn  =  —en  . 


(10) 


Substituting  this  into  (6)  and  (8): 


Ta 

Tp 


taA 


taPB 


(11) 

(12) 
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Differentiating  with  respect  to  n  leads  to 

d  jn  -  ed 

j  a  n  CL 

an 

(13) 

d  (1  -  d)e 

(14) 

Equating  the  sum  of  (13)  and  (14)  to  zero,  we  find: 

nopt—  ~  d  +  {1  —  5ip){d  —  1)-;^  , 

(15) 

71 — Tlopt 


where  the  Kronecker  delta  term  has  been  incorporated  to  reflect  the  uni-processor 
case. 

Note  that  (15)  is  an  implicit  equation  for  n^f,  as  Tq  and  are  dependent 
on  n.  However,  for  the  parameters  under  consideration,  the  ratio  Tp/Ta  is  not  a 
strong  function  of  n,  and  a  fixed-point  iteration  in  the  neighborhood  of  n  Uopt 
typically  converges  in  one  or  two  iterations.  Consequently,  the  structure  of  (15) 
reveals  much  of  the  expected  behavior  of  Uopt-  For  example,  when  P  is  small  or  d  =  1, 
the  communication  cost  is  negligible  and  we  recover  the  optimal  order  for  the  serial 
case,  =  f-  In  this  case,  the  optimal  order  increases  (and  correspondingly,  the 
number  of  subdomains  decreases)  with  increasing  spatial  dimension,  d,  decreasing 
error  tolerance,  e~%  or  decreasing  spectral  overhead  cost,  7.  It  is  interesting  to  note 
that  under  these  circumstances,  the  optimal  choice  of  n  is  independent  of  k,  and  from 
(4)  we  conclude  that  the  optimal  number  of  subdomains,  qopt,  is  proportional  to  the 
maximum  resolvable  wave-number,  k. 

In  the  general  case,  an  explicit  formula  for  Uopt  can  be  derived  by  setting  z  = 
and  substituting  into  (15).  Using  (11-12),  we  can  derive  an  equation  in  which  the 
z-dependence  is  explicit: 


/(^) 


d-ipB  2p 
7  A  enk 


(16) 


It  can  be  shown  that  f{z)  is  monotonically  increasing,  from  which  we  conclude  that 
there  is  an  increase  in  Uopt  whenever  the  right  hand  side  of  (16)  increases.  Thus,  an 
increasing  number  of  processors,  P  —  or  increasing  communication  cost,  /3f ,  leads 
to  an  increase  in  the  optimal  order  of  approximation,  and  corresponding  decrease  in 
the  optimal  number  of  subdomains,  subject  to  the  constraint  that  both  be  integers. 


5  Results 

In  this  section,  we  examine  the  parameters  which  influence  (qopt,?^opt)  in  a  multi¬ 
processor  implementation.  Both  two-  and  three-dimensional  problems  are  considered. 


Table  1:  Problem,  algorithm,  and  hardware  parameters. 
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with  maximuni  resolvabl6  wave  numbers  of  k  =  32,  and  16,  respectively.  These, 
along  with  the  specified  error  tolerance,  specify  the  characteristics  of  the  physical 
problem.  The  algorithmic  characteristics  have  been  selected  to  reflect  a  matrix-matrix 
product  based  approach  to  differentiation,  i.e.,  with  a  fairly  low  constant  A  and 
relatively  high  exponent  7.  The  multipier  d  in  the  communication  term  B  reflects 
the  fact  that  information  is  propagated  in  each  of  d  spatial  dimensions  whenever  the 
linear  convection  operator  is  applied.  The  hardware  parameters  are  representative  of 
dedicated  parallel  architectures.  The  nondimensional  communication  characteristics 
a  and  p  are  derived  from  tests  described  in  Appendix  A.  For  networks  of  workstations, 
the  latency  term,  a,  would  generally  be  much  higher.  While  this  impacts  the  overall 
parallel  efficiency,  it  would  not  impact  the  optimal  discretization,  as  noted  earlier. 
All  the  results  scale  directly  with  the  floating  point  cycle  time  ta-  However,  to  give 
the  times  a  realistic  scale,  ta  is  set  to  lO"*  seconds,  corresponding  to  a  nominal 
performance  of  100  MFLOPS  per  node.  Table  1  summarizes  the  baseline  parameters. 

The  impact  of  communication  overhead  is  illustrated  in  Table  2.  Values  of 
(gopt,nopt)  computed  from  (15)  and  (10),  along  with  model  estimates  of  time  per 
step,  Tp,  are  shown  for  P  =  1  to  4096.  In  order  to  highlight  the  slight  variations  in 
(gopt,  nopt)  the  values  have  not  been  rounded  up  to  the  nearest  integer.  In  addition, 
we  have  momentarily  relaxed  the  constraint  g  =  mp;  the  impact  of  granularity  is 
examined  below. 

The  results  of  Table  2  indicate  that  communication  overhead  has  strikingly 
little  effect  on  the  value  of  q^pt-  The  case  {d  =  2,  e"*  =  lO'^)  is  the  only  one  which 
exhibits  notable  (20  %)  variation  in  qopt-  Even  if  the  communication/computation 
ratio  is  artificially  increased  by  halving  7?  or  doubling  /?,  the  results  are  little  changed. 
Note  that  if  it  is  necessary  to  communicate  data  on  the  edges  in  addition  to  the  faces, 
an  approximate  bound  on  the  number  of  words  transferred  is  36^  in  IR  (twelve  edges 
of  length  ^  to  three  processors  each).  For  optimal  values  of  (9,  n),  this  is  significantly 

below  the  amount  of  face  data  transferred  (6  (^)^)  and  we  conclude  that  the  edge 
data  has  little  influence  on  the  optimal  discretization.  Finally,  note  that  increasing 
the  physical  complexity  by  increasing  k  simply  causes  a  proportional  increase  in  q^pu 
and  no  change  in  n^f 

It  is  useful  to  examine  the  sensitivity  of  solution  time  to  the  choice  of  (9,n). 
Fig.  1  shows  the  normalized  time-per-step,  PTp,  versus  n  for  error  criteria  10"^, 
10"®,  and  10“^,  with  P  =  1,4, 16,64,  and  256.  These  results  were  computed  using 
(9)  in  conjunction  with  (10).  In  this  case,  a  was  set  to  zero,  simply  to  allow  the 
multi-processor  results  to  collapse  onto  a  single  curve,  and  to  highlight  the  influence 
of  the  remaining  communication  term.  Also  shown  are  the  values  of  PTp  when  q  is 
restricted  to  be  an  integer,  i.e.. 


g  = 


e'Kk 

2n 


(17) 


These  are  seen  in  Fig.  1  as  the  jagged  counterparts  to  the  smooth  curves  derived 
directly  from  (9-10).  The  fact  that  the  multi-processor  curves  lie  on  top  of  one 
another  implies  near  unity  efficiency,  i.e.,  hUPtp)  ^  1,  which  would  be  the  case  if 
the  latency  were  in  fact  zero.  The  most  striking  feature  of  Fig.  1  is  that  the  continuous 
{q,  n)|g  curves  exhibit  a  broad  minimum,  especially  for  smaller  error  tolerance,  error  = 
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e“^  By  contrast,  the  time  for  the  discrete  {q,n)  pairs  exhibits  large  fluctuations, 
particularly  as  q  nears  unity.  To  exploit  the  broad  minima  of  the  continuous  results, 
it  is  best  to  choose  q  >  qopt,  with  appropriate  n,  in  order  to  minimize  deviation  from 
the  optimal  solution  times. 

The  results  presented  so  far  appear  to  indicate  that  parallelism  has  little  influence 
on  the  discretization  pair  {q,  n).  In  fact,  a  fundamental  constraint  which  has  not  been 
imposed  in  the  previous  model  problems  is  that  the  number  of  subdomains,  Q,  be 
an  integral  multiple  of  the  number  of  processors,  P.  The  combined  influences  of  this 
imposed  granularity,  nonzero  latency,  and  suboptimal  {q,  n)  pairs  are  shown  in  Fig. 
2.  The  solution  time  versus  number  of  processors  is  plotted  for  the  two-dimensional 
baseline  case  (Table  1)  with  error  criteria  =  10“^  and  10“®.  Four  values  of  n  are 
considered:  a  fixed  (arbitrary)  value  of  n  =  4,  the  optimal  serial  case,  n  =  Uopts,  the 
optimal  parallel  case,  n  =  riopt,  and  the  value  of  n  obtained  iiq  =  p,  i.e.,  corresponding 
to  one  subdomain  per  processor. 

For  this  problem,  the  low-order  (n  =  4)  case  requires  significantly  more  time  than 
the  other  cases.  The  optimal  curves,  which  are  essentially  indistinguishable,  exhibit 
significant  deviation  from  linear  speed  up  for  P  >  256,  due  to  latency.  However,  for 
P  >  64,  the  optimal  number  of  subdomains,  qopt,  is  less  than  p,  implying  that  the 
analytically  derived  optimal  performance  cannot  be  obtained  in  this  regime.  Instead, 
the  number  of  subdomains  must  be  set  equal  to  P,  and  the  performance  must  track 
the  q  =  p  >  qopt  curve.  Thus,  imposed  granularity  is  an  additional  contributor 
to  inefficiency.  We  note  that,  in  practice,  this  source  of  inefficiency  might  not  be 
encountered  since  it  is  common  to  increase  the  number  of  processors  with  increasing 
physical  complexity  and  problem  size,  resulting  in  “scaled”  speed-up,  as  noted  in  [8]. 
Since  qopt  scales  almost  directly  with  physical  complexity  (k),  it  should  be  possible  to 
maintain  q  w  qopt  in  the  scaled  speed-up  case. 


Table  2:  Optimal  discretization  pairs  and  time-per-step  (s) 


e-^  =  10-^ 

II 

1 

,-6 

d 

k 

P 

P 

Qopt 

'^opt 

Tp 

Qopt 

'^opt 

Tp 

2 

32 

1 

1 

11.9 

12.3 

2.2x10"^ 

4.0 

36.8 

5.1x10-2 

2 

32 

4 

2 

11.8 

12.3 

5.7x10"^ 

4.0 

36.9 

1.3x10-2 

2 

32 

16 

4 

11.7 

12.4 

1.5x10"^ 

3.9 

37.0 

3.3x10-3 

2 

32 

64 

8 

11.6 

12.5 

4.1x10-'* 

3.9 

37.2 

8.5x10"^ 

2 

32 

256 

16 

11.3 

12.8 

1.3x10-^ 

3.9 

37.5 

2.5x10-^ 

2 

32 

1024 

32 

10.7 

13.3 

6.6x10-^ 

3.8 

38.2 

9.4x10-3 

2 

32 

4096 

64 

9.8 

14.2 

4.7x10“^ 

3.6 

39.5 

5.4x10-3 

3 

16 

1 

1 

3.5 

18.4 

8.6x10"* 

1.2 

55.3 

2.0 

3 

16 

8 

2 

3.5 

18.6 

1.1  xlO"* 

1.2 

55.4 

2.5x10"* 

3 

16 

64 

4 

3.4 

18.7 

1.4x10-2 

1.2 

55.6 

3.1x10-2 

3 

16 

512 

8 

3.4 

19.0 

1.8x10-3 

1.1 

56.0 

4.0x10-3 

3 

16 

4096 

16 

3.3 

19.5 

2.9x10-^ 

1.1 

56.7 

5.6x10"^ 
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6  Conclusions 

Analytical  error  estimates  derived  in  [1]  have  been  used  in  conjunction  with  com¬ 
putational  complexity  estimates  for  parallel  spectral  subdomain  algorithms  to  derive 
optimal  discretization  parameters  for  time-explicit  numerical  solution  of  hyperbolic 
partial  differential  equations.  Communication  overhead  increases  the  optimal  ap¬ 
proximation  order,  n>opt^  over  that  derived  for  the  serial  case.  However,  for  realistic 
multicomputer  parameters,  the  change  is  typically  small.  Moreover,  because  the  opti¬ 
mal  solution  time  exhibits  a  broad  minimum  about  rzopt,  while  the  achievable  solution 
time  (due  to  integral  constraints  on  q  and  n)  exhibits  large  amplitude  fluctuations 
for  large  n,  there  is  incentive  to  choose  n  <  Uopt-  The  most  significant  impact  of 
parallelism  upon  the  choice  of  {q,  n)  is  that  it  may  potentially  impose  a  granularity 
upon  the  discretization  which  is  suboptimal,  i.e,  q  =  p>  qopt- 


n 


Figure  1;  Normalized  time  per  step  for  two-dimensional  problem  with  a  =  0. 
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Appendix  A:  Communication  parameters 

We  assume  that  contention  free  messages  consisting  of  m  words  have  a  point-to-point 
travel  time  given  by: 


^commi^)  —  "h  (3  ni)ta  ) 

The  model  does  not  account  for  distance  between  processors.  However,  for  most 
modern  message  passing  systems,  any  fluctuation  in  latency  due  to  greater  wire  sep¬ 
aration  between  processors  is  so  overwhelmed  by  other  sources  of  latency  as  to  be 
immeasurable. 

The  values  of  a  and  /?  are  measured  via  the  following  point-to-point  or  “ping- 
pong”  test.  For  each  processor  pair,  (0,  /c),  =  2*  - 1,  t  =  1, . . . ,  log2  P,  the  processors 

synchronize  and  commence  timing.  Processor  0  then  sends  an  m-word  message  to 
processor  k,  and  immediately  posts  a  blocking  receive.  Processor  k  posts  a  blocking 
receive,  and  upon  receipt  of  the  message  from  node  0,  sends  an  m-word  segment  of  a 
different  array  back  to  0.  This  continues  for  fifty  iterations,  each  message  sent  from 
and  placed  into  different  memory  locations  in  order  to  avoid  unduly  favorable  cache 
behavior.  Fig.  A  shows  the  measured  round-trip  message  times,  =  2  {a+Pm)ta,  for 
the  512-node  Intel  Delta  at  Caltech,  the  340  node  Intel  Paragon  at  Wright-Patterson 
AFB,  and  the  20-node  IBM  SP2  at  Brown  University  using  the  mpich  message  passing 
library  from  Argonne  National  Labs. 

Estimates  of  ta  are  derived  from  mflops  measurements  for  computation/memory- 
access  patterns  which  are  typical  of  the  computations  under  consideration.  In  this 
case,  the  work  is  dominated  by  matrix-matrix  products  of  order  n,  which  have  favor¬ 
able  caching  patterns  but  generally  do  not  attain  near  peak-performance  unless  the 
matrix  order  is  quite  large,  i.e.,  n  «  30  or  greater. 

The  measured  values  of  ta,  octa  and  f5ta  are  shown  in  Table  A,  along  with  derived 


Figure  2:  Time  per  step  for  two-dimensional  problem  with  truncation  error  of  e  *  = 
10"^  (left)  and  10“®  (right). 
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estimates  for  a  and  /?.  Also  listed  is  m2  =  o;//3,  the  characteristic  message  size 
distinguishing  short  (latency-dominated)  from  long  (bandwidth-limited)  messages. 


Table  A:  Machine  dependent  parameters  64-bit  arithmetic 


Machine 

ta  (s) 

ata  (s) 

Pta  (s) 

a 

P 

m2 

Delta 

1.0x10"'^ 

9.X10-5 

1.1x10-6 

900 

11 

80 

Paragon 

.66x10"^ 

5.X10-5 

.15x10-6 

760 

2.3 

330 

SP2 

.20x10"^ 

6.X10-5 

.27x10-6 

3000 

15 

200 
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